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Abstract —A heterogenous network with base stations (BSs), 
small hase stations (SBSs) and users distributed according to 
independent Poisson point processes is considered. SBS nodes 
are assumed to possess high storage capacity and to form a 
distributed caching network. Popular files are stored in local 
caches of SBSs, so that a user can download the desired files from 
one of the SBSs in its vicinity. The offloading-loss is captured via 
a cost function that depends on the random caching strategy 
proposed here. The popularity profile of cached content is 
unknown and estimated using instantaneous demands from users 
within a specified time intervai. An estimate of the cost function 
is obtained from which an optimaf random caching strategy is 
devised. The training time to achieve an e > 0 difference between 
the achieved and optimaf costs is finite provided the user density 
is greater than a predefined threshold, and scales as N'^, where 
N is the support of the popularity profile. A transfer learning- 
based approach to improve this estimate is proposed. The training 
time is reduced when the popularity profile is modeled using a 
parametric family of distributions; the delay is Independent of 
N and scales linearly with the dimension of the distribution 
parameter. 

Index Terms —Caching; small cell networks; popularity profile; 
transfer learning. 


I. Introduction 

The advent of multimedia-capable devices at economical 
costs has triggered the growth of wireless data traffic at an 
unprecedented rate. This trend is likely to continue, requiring 
wireless service providers to reevaluate design strategies for 
the next generation wireless infrastructure m. A promising 
approach to address this problem is to deploy small cells 
that can offload a significant amount of data from a macro 
base station (BS) 0. Doing so, it is expected to lead to 
cost-effective integration of the existing WiFi and cellular 
technologies with improved performance of peak data traffic 
steering policies 13. However, a potential shortcoming of the 
small cell infrastructure is that, during peak traffic hours, the 
backhaul link-capacity requirement to support data traffic is 
enormously high 11. Also, the cost incurred in deploying 
a high capacity backbone network for small cells can be 
quite high. Therefore, small cell-based solutions alone will not 
suffice to efficiently solve the quality of service requirements 
associated with peak traffic demands. 

A noteworthy development in this direction is to improve 
the accessibility of data content to users by storing the most 
popular data files in the local caches (intermediate servers 
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such as gateways, routers, etc.) of small cell BSs, with the 
objective of reducing the peak traffic rates. This is commonly 
referred to as “caching” and has attracted significant attention 
0 - ii. In the next subsection we mention a few references, 
which although by no means exhaustive, fairly indicate the 
scope and trend of research on caching. 

A. Literature review on caching 

Caching has received considerable attention in the wireless 
communications literature. In El, a two-layer hierarchical 
strategy termed New Snoop was proposed to cache the un¬ 
acknowledged packets from mobiles and BSs to significantly 
enhance TCP performance. In ifTOl . a technique based on the 
concept of content-centric networking was devised for caching 
in 5G networks, while in ca caching of video files was 
proposed by exploiting the redundancy of user requests and 
storage capacity of mobile devices with a priori knowledge 
of the locations of devices. In Ha, the effects of cache size 
and cached-data popularity on a data access scheme were 
studied to mitigate the traffic load over the wireless channel. 
In d, inner and outer bounds were proposed for the joint 
routing and caching problem in small cell networks, while 
in HI in-network caching was proposed for an information¬ 
centric networking architecture for faster content distribution 
in an energy-efficient manner. In-network caching was em¬ 
ployed in ifTSl for content-centric networks using a tool called 
“contrace” for monitoring and operating the network. The 
tradeoff between the performance gain of coded caching and 
delivery delay in video streaming was characterized in ifTbll . 
A polynomial-time heuristic solution was proposed in ifTTll to 
address the NP-hard optimization problem of maximizing the 
caching utility of mobile users. 

Caching has also made advances in device-to-device (D2D) 
communications. In na, a practical method was devised for 
data caching and content distribution in D2D networks to 
enhance assisted communications between proximate nodes. 
In HD, the outage-throughput tradeoff was characterized 
for D2D nodes, which obtained the desired file from nodes 
which had that file in its cache. In ||20l, the conflict between 
collaboration-distance and interference was identified among 
D2D nodes to maximize frequency reuse by exploiting dis¬ 
tributed storage of cached content. In ETIl . coded caching was 
shown to achieve multicast gain in a D2D network, where 
users had access to linear combinations of packets from cached 
flies. In ||22l, the throughput scaling laws of random caching, 
where users with pre-cached information made arbitrary re¬ 
quests for cached files, were studied. New caching mechanisms 
developed by modeling the network as independent Poisson 


point processes (PPPs) with full knowledge of the popularity 
profile can be found in ll2^ - ll26l . while the most recent results 
on caching in D2D networks and video content delivery are 
reported in lIZTll and ||28l. 

Caching has been addressed from an information-theoretic 
viewpoint as well. In ll2^ . it was shown that when cached- 
content demand is uniformly distributed, joint optimization of 
caching and coded multicast delivery significantly improves 
the gains; this setup was extended to the case of nonuniform 
distributions on demand and to a decentralized setting in ll^ 
and ED, respectively. In E^ . coded caching was achieved 
for content delivery networks with two layers of caches. 

B. Main contributions of this paper 

In the aforementioned references, the popularity profile of 
data files was assumed to be known perfectly. In practice, 
such an assumption cannot be reasonably justified; this was 
clearly highlighted in m - E21, where various learning- 
based approaches were employed to estimate the popularity 
profile. On the other hand, estimation procedures result in 
computational overhead especially in data-intensive realtime 
multimedia applications. Therefore, given the increasing de¬ 
mand for improving the quality of service for the end users, 
establishing the theoretical underpinnings of learning-based 
caching strategies is a topical research problem, and is the 
main subject of this paper. 

In this work, we relax the assumption of a priori knowledge 
of the popularity profile to devise a caching strategy. We 
consider a heterogenous network where the users, BS and 
small base stations (SBSs) are assumed to be distributed 
according to PPPs. Each SBS is assumed to employ a random 
caching strategy with no caching at the user terminal (see 
mi). A protocol model for communications is proposed using 
which a cost that captures backhaul link overhead that depends 
on the popularity profile is derived. Assuming a Poisson 
request model, a centralized approach is presented in which 
the BS computes an estimate of the popularity profile based 
on the requests observed during the time interval [0,t]; this 
estimate is then used in the cost function to optimize the 
caching probability. Thus, the actual cost incurred differs from 
the optimal cost, and this difference depends on the number 
of samples used to estimate the popularity profile. Further, 
the number of samples collected at the BS depends on the 
density of the Poisson arrival process and the training time 
during which the samples are collected. A lower bound on this 
training time is derived that guarantees a cost that is within 
e > 0 of the optimal cost. The results are improved using a 
transfer learning (TL)-based approach wherein samples from 
other domains, such as those obtained from a social network, 
are used to improve the estimation accuracy; the minimum 
number of source domain samples required to achieve better 
performance is derived. Finally, we model the popularity 
profile using a parametric family of distributions (specifically, 
the Zipf distribution ESl l to analyze the benefits offered. 

The following are the main findings of our study: 

(i) The training time r is finite, provided the user density is 
greater than a predefined threshold. 


(ii) T scales as 7V^ log N, where N is the total number of 
cached data files in the system. 

(iii) Employing the TF-based approach, a finite training time 
can be achieved for all user densities. In this case, the 
training time is a function of the “distance” between the 
probability distribution of the files requested and that of 
the source domain samples (the notion of distance will 
be made precise in the proof of Theorem [Dl. 

(iv) When the popularity profile is modeled using a para¬ 
metric family of distributions, the bound on the training 
time is independent of N, and scales only linearly with 
the dimension of the distribution parameter leading to a 
significant improvement in the performance compared to 
its nonparametric counterpart. 

The problem of periodic caching without the knowledge of 
the popularity profile, but with access to the demand history, 
was addressed in ll^ and lIMl : however, the model and 
objective function considered in our work are different from 
those presented therein. Fearning-based approaches to estimate 
the popularity profile for devising caching mechanisms have 
also been reported in ES - ED; while caching in femtocell 
networks without prior knowledge of the popularity distribu¬ 
tion was considered in ||39l, where it was shown that dis¬ 
tributed caching was NP-hard and approximation algorithms 
were proposed for video content delivery. We would like 
to emphasize that the central focus of this paper is not on 
deriving new caching mechanisms. Our main contribution is 
the theoretical analysis of the implications of learning the 
popularity profile on the training time to achieve an offloading 
loss which is e > 0 close to the optimal policy. To the 
best of our knowledge, this is the first instance where an 
analytical treatment of training time and its relation to the 
probability distribution function of source domain samples has 
been reported in the literature on caching. Some preliminary 
aspects of this work can be found in ll40l . 

In Section HI] we present the system model followed by the 
main problem addressed in the paper. The two methods for 
estimating the popularity profile and its corresponding training 
time analysis are developed in Section |III| The training time 
analysis when the popularity profile is modeled as a parametric 
family of distributions is presented in Section |IV| Numerical 
results are reported in Section [V] Concluding remarks are 
provided in Section The proofs of the theorems are 
relegated to appendices. 

II. System Model and Problem Statement 

In this section, we present the system model followed by 
the main problem addressed in the paper. The notation used 
in the rest of the paper is as follows: A’s and (Xu, 

As and Ab) denote the points (densities) corresponding to the 
user, SBSs and BS, respectively; kx denotes the number of 
requests in [0,r] by the user at x; Xx'^ denotes the /* request 
of the user x; A^. is the average number of requests per unit 
time. A heterogenous cellular network is considered where the 
set C K.^ of users, the set C of BSs, and the set 
C of SBSs are distributed according to independent 
PPPs with density Xu, Xb and As, respectively, in the two- 
dimensional space ED. Each user independently requests a 


data-file of size B bits from the set = {/i, / 2 , • ■ •,/at}; 
the popularity of data files is specified by the distribution 
y = {pi,... where = 1 and is assumed to 

be stationary across time. In a typical heterogenous cellular 
network, the BS fetches a file using its backhaul link to 
serve a user. During peak data traffic hours, this results in an 
information-bottleneck both at the BS as well as in its backhaul 
link. To alleviate this problem, caching the most popular files 
(either at the user nodes or at SBSs) is proposed. The requested 
file will be served directly by one of the neighboring SBSs 
depending on the availability of the file in its local cache. The 
performance of caching depends on the density of SBS nodes, 
cache size, users’ request rate, and the caching strategy. It is as¬ 
sumed that the SBS can cache up to M files, each of length B 
bits. Each SBS in caches its content in an independent and 
identically distributed (i.i.d.) fashion by generating M indices 
distributed according to 11 = {tt^ : fi G 3^, i = 1,2,..., N}, 
Sti ’’■j = 1 (see lfT9l ). One way of generating this is to 
roll an N sided die M times in an i.i.d. fashion, where the 
outcomes correspond to the index of the file to be cached. 
Although this approach is suboptimal, it is mathematically 
tractable and the corresponding time complexity serves as a 
lower bound, albeit pessimistic, for optimal strategies. 

We now present a simple communications protocol to 
determine the set of neighboring SBS nodes for any user in <I>„. 
Essentially, we let each SBS at location y G communicate 
with a user at location a; € <I>„ if \\y — a;|| < 7 , (7 > 0 ); 
this condition determines the communication radius. In this 
protocol, we have ignored the interference constraint. The set 
of neighbors of the user at location x is denoted 

= {y G ^>8 : ||y - a;|| < 7}- (1) 


A. The main problem addressed in this paper 

The user located at a; € <I>„ requests a data-file from the 
set T, with the popularity profile chosen from the probability 
distribution function T. The requested file will be served 
directly by a neighboring SBS at location y G depending 
on the availability of the file in its local cache, and following 
the protocol described in the previous paragraph. The problem 
of caching involves minimizing the time overhead incurred 
due to the unavailability of the requested file. Without loss 
of generality and for ease of analysis, we focus on the 
performance of a typical user located at the origin, denoted 
by o G The unavailability of the requested file from a 
user located at o is given by 

B ^ 

T(n, T) 4 E V [!{/, i K}] !{/. requested}, (2) 

where Afo is as defined in O, i?o is the rate supported by 
the BS to the user, and is the time overhead incurred in 

riO 

transmitting the file from the BS to the user. Eurther, we use 
fi ^ 2^0 to denote the event that the file fi is not stored in 
any of the SBSs in iN^o. The expectation is with respect to 
and CP. The indicator function 1{A} is equal to one if the 
event A occurs, and zero otherwise. We refer to T(n, CP) as 


the “offloading loss”, which we seek to minimize; 


min cr(n, CP) (3) 

n^o 

N 

subject to TTi = 1, 

i=l 


where tt^ > 0, for i = 1,... ,N. To solve the optimization 
problem Q, we need an analytical expression for T(n,CP) 
which is provided in the following theorem. 

Theorem 1: Eor the caching strategy proposed in this paper, 
the average offloading loss is given by 


T(n,cp) 


B 

Rq 


■ N 

^exp{-As7r72 [l - (1 - Tti)^]}pr 


(4) 


Proof: See Appendix lAl ■ 

We note that, solving the optimization problem posed in 
(E) is not the main focus of this paper. We assume that 
there exists a method to solve the problem posed in (E), and 
instead focus on analyzing the training time required to obtain 
a good estimate of the popularity profile that results in an 
offloading loss that is within e of the optimal offloading loss. 
Interestingly, although the problem in (EJ is non-convex, since 
it is separable a bound on the duality gap can be obtained with 
respect to the solution derived using the Karush-Kuhn-Tucker 
conditions. 

In practice, the popularity profile CP is generally unknown 
and has to be estimated. Denoting the estimated popularity 
profile by CP = {pi,... ,pm}, and the corresponding offloading 
loss by T(n,!P), © becomes 


min Cr(n, T) (5) 

n^o 

N 

subject to TTi = 1 , 

i=l 


with TTi > 0, for i = 1,...,N. Naturally, the solution to 
(Ell differs from that of the original problem (El). Let 11 * and 
II* denote the optimal solutions to the problems in (El) and 
( 0 , respectively, and let the throughput achieved using II* be 
denoted T* 4 T(n*,T). The central theme of this paper 
is the analysis of the offloading loss difference, i.e., CT* — 
T*, where T* 4 T(n*,CP) is the minimum offloading loss 
incurred with perfect knowledge of the popularity profile 
CP. Theorems El - El are devoted to this analysis. 


III. Estimating the popularity profile 
In this section, we present two methods for estimating the 
popularity profile and provide the corresponding training time 
analyses. The efficiency of the estimate CP of the popularity 
profile depends on the number of available data samples, 
which in turn is related to the number of requests made by 
the users. We first obtain an expression for the estimate of 
the popularity profile. We then study, in Section IIII-Al the 
minimum training time in obtaining the samples to achieve a 
desired estimation accuracy e > 0. Einally, in Section IIII-BI 
we employ the TL-based approach to improve the bound on 
the training time. We begin with the definition of the request 
model. 








Definition 1: {Request Model) Each user requests a file 
/ G at a random time f G [0, c»] following an independent 
Poisson arrival process with density > 0. 

For notational convenience, the same density is assumed 
across all the users. The following centralized scheme is 
used where the BS collects the requests from all the users 
in its coverage area in a time interval [0, r] to estimate the 
popularity profile of the requested files: Let the number of 
users in the coverage area of BS z G of radius R > 0 
be nji, which is distributed according to a PPP with density 
Xu- Let the number of requests made by the user at the 
location x G {$up|B(0, i?)} in the time interval [0 ,t] be 
kx, where ]B(0, R) is a two-dimensional ball of radius R 
centered at 0. We assume that requests across the users are 
known at the BS. The requests from the user x is denoted 
"Xx = ..., where Xx'^ G {1 ,... ,N} denotes 

the indices of the files in T, I = 0,... ,kx. After receiving 
'^x, X G {$„ n ®(0, f?)}, in the time interval [0 ,t], the BS 
computes an estimate of the popularity profile as follows: 

p. = Y - - -r S i;iW> = i},(6) 

i = 1,.. . ,N. Given the number tir of users in the coverage 
area of the BS, the sum X]a;g{B(o ^ 

density n/jA^. Also, E {pi| !{$„ H 1(0, i?)}| = n^} = pi, 
which leads us to conclude that pi is an unbiased estimator. 
The estimated popularity profile pi given by (|6]l is shared with 
every SBS in the coverage area of the BS, and is then used in 
0 to find the optimal caching probability. 

The proposed estimator can be improved by using samples 
from other related domains, for example, a social network. 
The term “target domain” is used when samples are obtained 
only from users in the coverage area of the BS. In the 
next subsection we derive the minimum training time r, 
corresponding to the estimator in (|6]l, required to achieve the 
desired estimation accuracy e > 0. 

A. A lower bound on the training time r 

Theorem 2: For any e > 0, with a probability of at least 
1 — S, a throughput of T* < T* -b e can be achieved using the 
estimate in 0 provided 

r>| (7) 

[ oo otherwise, 

where {x}+ = max{x, 0}, p* = (1 — exp{—2e^}), L = 
^log(^) and 

_ ^ Rq€ 

2BsupnEfci5(7!'*)’ 

with g{TTi) = exp{-As 7 r 72 [l - (1 - }. 

Proof: See Appendix [ff] ■ 

To achieve a finite training time that results in an estimation 
accuracy e > 0, the user density A„ has to be greater than a 
threshold. Further insights into (|7]) are obtained by making 
the following approximation: 1 — a; < e~^ for all a; > 0. This 


is combined with supjj.jj^Q ^ yielding 

the following lower bound on the training time r: 


r > 


2B^ 


7ri?2A„Ari?Qe^ 


log 


2N 

~T 


(9) 


The lower bound (|9|l enables us to make the following obser¬ 
vations: 


(i) The training time r to achieve an e-offloading loss 
difference scales as 7V^, 

(ii) r is inversely proportional to {Xu, Xr), and 

(iii) as the coverage radius increases, the delay decreases as 
1 /i?^, and 

(iv) as the data-file size B increases, the training time scales 
as B^. 

The bound in (|9]l is a lower bound on the training time per 
request per user, since the offloading loss is derived for a given 
request per user. There are on an average A^ requests per unit 
time per user. Thus, to obtain the training time per user, the 
offloading loss has to be multiplied by A^. This amounts to 
replacing e by e/Xr. Therefore, (|9]l becomes 


T > 


2B'^Xr 

■nRfXuR^e^ 


log 



( 10 ) 


It is seen that the training time scales linearly with A^. 
Although the training time per user per request tends to zero 
as Xr —?► oo, the training time per user tends to oo. This 
is because the number of requests per unit time approaches 
oo, and thus, a small fraction of errors results in an infinite 
difference in offloading loss leading to an infinite training 
time. With the increasing demand to provide higher quality of 
service for the end user, the question of whether it is possible 
to improve {i.e. decrease) the training time r to achieve the 
desired estimation accuracy e deserves attention. In the next 
subsection we show that the lower bound on the training time 
can indeed be improved by employing a TL-based approach. 


B. Transfer learning to improve the training time 

In practice, the minimum training time required to achieve 
an estimation accuracy e > 0 can be expected to be very 
large. An approach to overcome this drawback is to utilize 
the knowledge obtained from users’ interactions with a social 
community (termed the “source domain”). Specifically, by 
cleverly combining samples from the source domain and users’ 
request pattern (target domain), one can potentially reduce the 
training time. In fact, the estimation accuracy is indicative of 
the dependence between the source and target domains. These 
techniques are commonly referred to as TL-based approaches, 
and have implications on the training time to achieve a given 
estimation accuracy. TL-based approaches were also employed 
in and llJTl to negotiate over-fitting problems in estimating 
the content popularity profile matrix. However, unlike in ll^ 
and Ezl, in this paper we are interested in deriving the 
minimum training time to achieve a desired performance 
accuracy. Furthermore, the model we consider is quite different 
from those considered in and llJTll . 

The TL-based approach considered here comprises two 
sources, namely, the source domain and target domain, from 








which the samples are acquired. An estimate of the popularity 
profile is obtained in a stepwise manner as follows; 

(i) Using target domain samples, the following parameter is 
computed at the BS: 

^ ^l{X«=z},Z=l,...,iV. ( 11 ) 

a:eB(0,it) n 4’u ^=0 


“close.” This is made precise in the following proposition, 
and a detailed discussion is provided in Section IV] 
Proposition 1: For any e > 0 and 6 G [0,1], the TL- 
based approach performs better than the source sample- 
based agnostic approach provided the number m of source 
samples satisfies m > (^) 

distributions satisfy the following condition: 


Recall that kx is the number of requests made by the 
user at the location x. The corresponding Z* request by 
the user at the location x in the time interval [0,t] is 
denoted Xx \ ( = 1 , 2 ,..., kx- 

(ii) The source domain samples X'* = {X^,..., X^} are 
drawn i.i.d. from a distribution Q, where X® = i (i = 
1,..., N) denotes that the user corresponding to the 
(* sample has requested the file fi. The nature of the 
distribution will be made precise in Proposition [T] Using 
this, the BS computes 

m 

= i = ( 12 ) 

k=l 

(iii) The BS uses (fTTl i and (fT2l i to compute an estimate of 

(the superscript tl indicates transfer learning) given 
by 


l|J’-Q||oo< 


ei?n 


2BXu7T-f^N' 


(16) 


where F = 


1 — exp 


l-e 


■log(l -L) 


In fact, (fTbl l provides the guiding principle to decide if 
the samples drawn from the distribution Q should be used 
to estimate the distribution T. In general, the distance 
between the distributions has to be estimated from the 
available samples (relative to the distribution on T). 

An estimate of the popularity profile can also be obtained 
by linearly combining its estimates obtained from the source 
domain and target domain samples. In particular, we have 


Pi = ap^i'^ + (1 - a)p. 


(i) 


(17) 


4ti) 
p\ = 


S) 


{tar) 


(13) 


SxG{B( 0 ,fl) n 'I’l*} 

Using the estimate given by (flSl l. a lower bound on the training 
time is obtained as stated in the next theorem. 

Theorem 3: Let g{Tri) = exp{—As7r7^ [l — (1 — TTi)^]}. 
Then, for any accuracy 


2B 


e > 


supn {^^15(7^*)} 
Ro 


l|J’-Q||c 


(14) 


with a probability of at least 1 — S, a throughput of T* < T*-|-e 
can be achieved using the estimate in (fOl l provided the training 
time r satisfies the following condition; 

r>l (15) 

[ (X), otherwise. 


where p = (log _ 2e2^TO), Cpq = e - \\T - Q||oo, 
A = CT (log ^ and e- ^ 2Bsup„{ELg(^^)} - 

Proof: See Appendix ICl ■ 

From Theorem [3l we see that under suitable conditions the 
TL-based approach performs better than the source domain 
sample-based agnostic approach. The following inferences are 
drawn: 


(1) The minimum user density to achieve a finite delay is 
reduced by a positive offset 2e^qm. In fact, for m > 

log (^) 

2 (e-[|qi-Q|| a finite delay can be achieved for all user 
densities which provides a significant advantage. 

(2) The finite delay achieved is smaller compared to the source 
domain sample-based agnostic approach for large enough 
numbers of source samples, and the distributions are 


where and pf'^ are the estimates of the popularity profile 
obtained from the source domain samples and the target 
domain samples, respectively. The estimates are given by 


ft) 




is) 


^{tar) 

m 


(18) 

(19) 


Note that, in this case the coefficients are independent of 
the realization of the network. For the estimate proposed in 
(Ell, we have the following result: 

Theorem 4: For any accuracy 


e > 


2 Bsupn {E*=i5(77*)} 


Ro 


||T-Q||c 


( 20 ) 


with a probability of at least 1 — b, a throughput of T* < T* -f e 
can be achieved using the estimate in (fTTl i provided the training 
time r satisfies the condition specified by (|2TI) at the top of 
the next page, where 


Pthresh — 




log' 


1 — (g^) exp{—2u;^m} 


" (1 -exp {-2p^}), gin,) ^ 

exp{—As7r7^ [l — (I — ni )^]} and uj = > q. This 

is valid for all 0 < a < min{^, l} and 0 < g < 

whereG4||T-Q||^ + ^/^logM. 


Proof: See Appendix |D 



















, if Xu ^ Pth res j 


( 21 ) 


T > 


rQt 


log 


1 -^ 




:^)+iog 


l-(^) exp{-2i>2m} 


OO, 


Otherwise, 


IV. Parametrized Family of Popularity Profile 


In the previous sections, no structure was imposed on the 
popularity profile. In practice, the popularity profile is modeled 
using a parametric family of distributions such as the Zipf 
distribution 1 ^ . which, with a one-dimensional parameter 
0 e R, is specified by pe i = / -e ; ^ = 1, 2,..., To 

obtain an estimate of the Zipf distribution it suffices to find 
the parameter 0; estimating a single parameter requires fewer 
samples which can potentially reduce the training time. We 
now derive bounds on the training time when the popularity 
profile belongs to a parametric family of distributions. We 
begin with the following assumption: 

Assumption 1: Let the family of parametrized popularity 
distributions be defined by CP = {CPe : 0 C [a,bY,a < 6}. 
Further, for all 0 C [a,bY, CPe satisfies l|(^ePe,j ||2 < 

C, where C < oo is independent of N, and d^pQ^i € 
denotes the sub-differential of pe.i- For example, the Zipf 
distribution pg^i = / -a ; = 1,2,...,A^ satisfies this 

property. 

Let the true underlying parameter be 0 := {0i,...,0d}. 

Note that Qj G [a,6] for all j = Let 

the BS observe Up (number of requests) i.i.d samples 
{Xtp, Xt, 2 , ■ ■ ■, S Xj** drawn from the distribution 

CPe. Also, let := (0„p,*,i, 0np,*,2, ■ ■ ■, ©rip.j.d) e 
i = 1,2,... ,np denote the estimate of 0, based on a single 
observation, i.e., = f{Xt^i), i = 1,2, ...,np, where 

f : Xt ^ [a, bY is an unbiased estimator of 0. In the 

above, Up denotes the number of requests made by the users 
corresponding to the BS 2 in a time interval of [0,t]. Since 
/(•) is an unbiased estimator of 0, we have E |0ip|0| = 0 
for all i = 1,2,..., Up. The estimate of 0 using Up samples 
is obtained as follows: 


rip 

= ( 22 ) 

Note that 0„p := (©n^.i, 0 „p, 2 , • ■ •, ©n^.d), is also an unbi¬ 
ased estimator of 0, i.e., E |0np|0,iip| = 0- The following 
theorem provides a bound on the time complexity for a family 
of parameterized popularity profile satisfying Assumption 1. 

Theorem 5: For the family CPe satisfying Assumption 1, 
and given the estimator 0„p, T* < T* -P e for every e > 0 
with probability at least 1 — (5 if 


T > 


Ap(l-e—") 


log 


1 - 




log 


2d 


(23; 


for Xu > log otherwise r = oo, where cr^ 

20^ r. A Roe 

dc‘^{b-a)‘^ ana 


Proof: See Appendix IE] ■ 

From (|2^ . we see that the bound on the training time is 
independent of N, and from a scaling perspective, the training 
time scales with d, \r and A„. This amounts to a significant 
improvement compared to the nonparametric model studied in 
the previous sections of this paper, where the training time is 
shown to scale as iV^ log N. A natural extension is to utilize 
the knowledge obtained from users’ interactions with a social 
community, namely, the source domain samples. In the next 
subsection, we analyze the time complexity bound employing 
the TL-based approach for popularity profiles modeled using 
a parametric family of distributions. 


A. Transfer Learning for Parametric Models 


In this subsection, we derive a lower bound on the training 
time when the BS has access to the source domain samples 
along with the target domain samples. Let the source domain 
samples (Ag i,Xs 2 , ■ ■ ■ € X'" drawn i.i.d. from CPe^, 

where ©g G Further, as before, we assume that 3 / : 
Xg —R'^, an unbiased estimate of 0g. As before, let the BS 
observe Up i.i.d. target domain samples from X"'’ drawn from 
CPe. An estimate of 0 based on the available source and target 
domain samples is obtained as follows: 


(i) Using the source domain samples an estimate of 0g, 
denoted 0 g, is obtained in manner similar to that of target 
domain parameter 0 as explained earlier in this section. 

(ii) Using the target domain samples, an estimate of 0 
denoted 0t is obtained as in (l22l i. 

(iii) The two estimates are fused to get an estimate of 0 as 
0ti = A0t-|-(1—A)0g, where A G [0,1] will be described 
shortly. 


Theorem 6: For the family CP 0 ' satisfying Assumption 
1, and given the estimator 0ti = XQt + (1 ~ A)0g, 
we have CT* < CT* -|- e for every e > 0 with a prob¬ 
ability of at least 1 — b if the condition specified by 
(I24I 1 at the top of the next page is satisfied, for A^ > 


^(^logf+log- 

for all Dt < § - 

O — 2 - f — n \ /T^ — _O A e 

A VC ^ d(l —A)^(fc—a)^ ’ sup^5(7ri)’ 

? = If, and G := ||0 - 0 g ||2 + {b - 

Proof: See Appendix 10 ■ 

It is important to note that the aformentioned bound is inde¬ 
pendent of N. In the following section, we provide numerical 
results to get further insights into the expressions derived in 
the paper. 


AG and 0 < A < min 




This holds 


{^,1}. Here, 
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Ar(l - e-‘"t ) 


log 


1 - 


log ^ + log 




(24) 



Fig. 1: Training duration versus N, corresponding to Theorems 
in and [3 


V. Numerical Results 

In this section, we provide numerical results and de¬ 
rive insights into the analyses carried out in the previ¬ 
ous sections. The parameter values used in our calculations 
are as follows: B = 10^ bits, Rq = 10® bits/s, 7 = 
100m, Xu = 0.001 nodes/m^, = 1/360 requests/s. 
As = 10“® nodes/m^, S = 0.02, R = 2 Km, m = 
10 ® samples, j3e = 0.6 and /3; = 0.2. e is chosen as 
a fraction of a lower bound on the offloading loss, i.e., 
T(n,T) > exp{—As7r7^}. In particular, e = fraction x 
-^exp{-As 7 r 72 }. Further, ||T - Q||oo = 0-1 (^1^) which 
for the above parameters is of the order of 10/A^. 

Fig. □ shows a plot of the lower bounds on the training 
duration obtained in Theorems |2] and [3 as functions of the 
support N. It is seen that, for N < 70 the TL-based approach 
provides significant performance improvement. However, for 
N > 70, the performance of the TL-based approach de¬ 
grades compared to the approach that uses only the source 
domain samples (and, hence, can be called agnostic). This 
suggests that for larger values of N, the estimate of the 
popularity profile obtained using (fTTl i performs poorly due to 
incorrect fusion of the estimates obtained from source and 
target domains. Fig. |2] shows the plots of the lower bound 
in Theorem |4] corresponding to the estimate obtained by a 
fixed linear combination of the source and target estimates 
(see O). As seen in the figure, this does not bring any 
performance improvement and in fact sometimes performs 
poorly compared to the source domain agnostic approach. 
This is because the fixed linear combination does not have 
the flexibility to adapt to different realizations of the network, 
proving the sub-optimality of the estimate in ([TtT i compared 



Fig. 2: Training duration versus N, corresponding to Theo¬ 
rem |4] 



Fig. 3: Training duration versus m for a fixed N{= 10). 


to that in (fT 3 l l. It is also seen that the coefficients used in the 
estimate that adapts to the varying realizations of the network 
as in (Ell is beneficial. 

Fig. [3 shows a plot similar to that in Fig. [T] but with 
iV = 10 and varying m. It can be seen that the TL-based 
approach performs better for all m > 1000 demonstrating its 
applicability in practice. As seen, the performance is better 
for higher values of the fraction which corroborates intuition. 
Fig. |4] also shows a plot of time duration versus m for a fixed 
N = 10. It can be seen that the estimate in (fTTl i outperforms 
the agnostic approach; however, this is observed at very high 
values of source domain samples (m = 10500 and m = 16000 
for fraction = 0.5 and fraction = 0.4, respectively). Thus, 
although the TL-based approach using the estimate (fTTl i has 






































Fig. 4: Training duration versus m for a fixed N{= 10). 
Comparing TL-based and agnostic approaches. 


some benefits, it is not desirable for practical applications. 



Fig. 5: Training duration versus 0 when the popularity profile 
is modeled using a parametric family of distributions. C = 
2, fraction = 0.6, ||0 — 0s||2 = 0.1, (6 — a) = 0.5. 

The main benefits of the TL-based approach are shown in 
Fig.|5]for the parametric family of popularity profiles. It can be 
seen that the TL-based approach performs significantly better 
than the source domain agnostic approach for values of m as 
low as 10. This is because the number of parameters to be 
estimated scales with the dimension of 0 rather than with the 
support. In particular, as d increases the training duration also 
increases, which is quite expected. However, the delay scales 
only linearly in d as compared to quadratic scaling experienced 
with the nonparametric method. 

VI. Concluding Remarks 

The popularity profile for caching in distributed heteroge¬ 
nous cellular networks was estimated at BS using the available 
instantaneous demands from users in a time interval [0, t]. We 
showed that a training time r to achieve an e > 0 difference 
between the achieved cost and the optimal cost was finite, 
provided the user density was greater than a threshold; r was 


shown to scale as square of the support of the popularity 
profile. A TL-based approach was proposed to estimate the 
popularity profile, and a condition was derived under which 
it performed better than the target domain sample only based 
approach. Although TL-based approach performs better, the 
error that is achieved in (l20l i depends on ||T—Q||oo, suggesting 
that lower the distance between the two distributions better 
the TL scheme performs. From Proposition [T] the benefits of 
using target domain samples can only be realized with the 
knowledge of the distance ||T— Q||oo- The main benefit of the 
TL-based approach is recognized when the popularity profile 
is modeled using a parametric family of distributions. In this 
case, the delay is independent of N and scales only linearly 
with the dimension of the distribution parameter. In practice, 
caching depends on several factors such as the scheduling 
scheme used, which in turn depends on the channel conditions, 
QoS requirements, etc. An important assumption that we make 
is that if the requested file is present in one (or more) of 
the neighboring SBSs, the transmissions are scheduled within 
a tolerable time frame. In the case of caching, this time 
duration could be slightly relaxed, and can be thought of as an 
abstraction of the scheduling scheme employed. If the file is 
not present, regardless of the scheduling policy, the file cannot 
be served locally. Hence, the approach that we have leads to 
a lower bound, albeit pessimistic, on the training time. Thus, 
even under pessimistic situations, the training time scales as 
N'^ log N for achieving an offloading loss that is e > 0 away 
from the optimal offloading loss. 
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Appendix A 
Proof of Theorem[T| 

The first term in (|2]i, l{/i ^ requested}, 

can be written as 
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^ exp {-If}Pi, 
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where 11 = As7r7^ [l — (1 — TTi)'^]. In the above exposition, 
(a) follows from the fact that the proposed random caching 
scheme is independent across users, (b) is due to the fixed 
cache size (M), and (c) follows since Ug is a PPP with mean 
As7r7^, where Ug is the number of SBSs in a circular area of 
radius 7. This completes the proof of Theorem [1] ■ 


Appendix B 
Proof of Theorem[2] 

For any e > 0, the inequality PrjT* > T* + e} < 
Pr {2supi^n^O:iT’n=i > ^1} proved, where AT = 
T(n, 7) - T(n, 7). And, t* - T* can be written as (see ||42l) 


T* - inf T(n, T) < T* - T + sup T(n, T) - T(n, T) 
n n 
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where T = T(n, T), thus proving the inequality. Sub¬ 
stituting for T(n, T) and T(n, CP) from (|4]i we get 
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bounded as follows: 
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g{'Ki) = exp{—As 7 r 7 ^ [l — (1 — }, and the last inequal 

ity follows by applying Hoeffdings inequality (see BSl l since 
the estimator !P is unbiased and tt^, tt € [0,1]. Note that, the 
expectation in (l25T l is with respect to Up. Conditioned on the 
number nn of users in the coverage area of BS, Up is a Poisson 
distributed random variable with density nnXrT. Therefore, 
2jVEX:r=o^^P{~g} = 2iVE„,,exp{-Arnflr5*}, 

where g = ( 2 e^fc + Arn^r) and g* = (l — exp {— 2 e^}) 
which can further be simplified as 


< 7VE, 
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provided e > Hy — Q||oo. where ||T — Q||c 
supjg[i jv] ki ~Pi\- From Hoeffding’s inequality. 
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where a = Arn^r exp {— 26 ^^}, Epq = e — 
\\y - Q||oo and pp, = XrnRT{l - exp{-2e2^}). 
Therefore, 2iVE„p exp {—2ep^(np + m)} = 

2 exp {— 2 ep^m} exp{—A„ 7 ri?^<}, where 
t = (l — exp{—ArT (l — exp{— 2 ep^})}), and is at most 
(5 > 0 if 
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= 2A^exp{—A„ 7 ri?^ (1 — exp{—Ar-rg*}). (26) 

We see that Pr {supn lATj > |} < 5 if ( l26] l is upper bounded 
by i5, resulting in 


provided A^ > ;;jr^log 2 A^ otherwise r = oo, proving 
Theorem 121 ■ 


provided A„ > (log 2A _ 2e2^m), otherwise r = oo, 
thus proving Theorem [3] ■ 
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Appendix C 
Proof of Theorem[3] 

It is easy to see that PrIT* > T* + 
e} < Pr|supi<,<jv -Pi > e}, where 

e — -—}—rr and g( 7 ri) = 

exp{—As 7 r 7 ^ [l — (1 — TTi)^]}. Denote by Up the total 
number of requests in the coverage area of the BS. 
Conditioned on the number of users ur in the coverage area 
of the BS, n„ is a Poisson distributed random variable with 
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and g(7ri) = exp{— As7r7^ [l — (1 — }. Each term in 

the summation can be upper bounded as shown in (l29l) - (l32l) 
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where (a) follows from the triangular inequality and using 
II? — Q||oo = supj |pi — qi\- Note that, the inequality (a) is 
valid if oj > ||? — Q||oo- Using Epi = pi and Hoeffding’s 
inequality, we have Pr | > w — ||? — Q||oo| < 
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N 


< Pr i ||0„^ - 0 II 2 sup ^ ||5pe.,i||2 > 

[ e*e[e.e„p] i=i 

< Pr/||0„ -0||1> 


fj2 


(b) 

< Pr sup 

, l<2<d 


Qn„,i - Qi 


> 




dC^} 


< dPr 


0„.,_,: - 0, 


\ 


the fact that ||0„p - 0||| = X[i=i 

2 


02 


< 


t^SUPi<,;<^ 


0rj.«.7' 0? 


(34) 

and 


, which 


Xu > ^ flog (¥)+log {ir(2^exp{-2^2„} 

is valid if (^) exp{—2d;^7Ti} < 1. This along with w — ||1P — 
Q||oo > 0 leads to the constraint stated in Theorem |4] ■ 


Appendix E 
Proof of Theorem[5] 

We begin with 

Pr{t* > T* + e} 
r ^ I 

< Pr<^ sup 5(7r)^ Pe^ -pe,, 


First, note that for all i, Qnp,i is an unbiased estimate of 0^. 
Further, a < 0, < b for j = 1,2,... ,d. Thus, by applying 
Hoeffding’s inequality, we have 

I 

> j- < 2dE„p exp {-ripcr^} , 

(36) 

Conditioned on the number of users 


dPr 


^rip.i 02 


2 A 20"^ 


where a" = 

(denoted riR) in a radius of R around the BS, Up is PPP with 
density XrTnu. Using this fact in (l36T l. we can write 


dPr 


^rip.i 02 


> 




dC^} 

< 2dE„^ {E„p [exp {-ripcr^} \np] } 


= 2dE,, 




/c=0 




{nfiXr 


o —\k 


fc=0 


k\ 


where (a) follows from the fact that supg<^<]^ p(7r) = 
1, Pe^ i is the estimate of pQ^i, fl = and 

g['K) = exp{—A„7r7^ [1 — (1 — TTi)^]}. By using the re¬ 
mainder form of the Taylor series, pg, ■ = pQ^i + (0 — 

'^V ’ ’ 

0r2p)9pe*,2|Q.g[g, ], where [0, ©n^] represents the line 

joining the points 0 and 0np, leading to (recall that the i- 
th component of 0„p is denoted by Qup.i, i = 1, 2,..., d) 


= 2dE„jj exp{-ni{ApT(l + e '^ )} 

00 / ^ D2\fe 

= 2dy exp{-fcApr(l + 

k\ 

fe=0 

= L-{r), (37) 

where = 2dexp A„7ri?^ — exp{—ArT(l — | 

is a monotonically decreasing function of r for all r > 0. 

Thus, /,.(r) < d if r > log 

for A„ > —^ log proving Theorem |3 ■ 


Appendix F 
Proof of Theorem[6] 


We begin with 


PrIT* > T* + e} < Pr | sup 5 ( 71 ) V Ipg . - pe,i > e| 

[ JV >, 

^ Pe„.2-Pe.2 


(35) 


. 2=1 


> n 


where (a) follows from the Cauchy-Schwartz inequality 
and Assumption 1 in Section lYl and (b) follows from 


where p^ ^ is the estimate of pe_i using the TF-based 

-Rgg 


approach described in Section IIV-AI U = and 5 ( 71 ) = 


exp{—A„7r7^ [l — (1 — Tti)^\ }. Note that, ©g = A0s + (1 — 
























































Pr{T* > T* + e} 


< Pr|||4,-0||2>^| 

< Pr |a||0s - 0 II 2 + (1 - A)||0, - 0 II 2 > ^ 

= Pr|A||0s-0||2 + (l-A)||0,-0||2> 

+ Pr |a||0. - 0 II 2 + (1 - A)||4 - 0 II 2 > § n 

< Pr |a||0, - 0 II 2 > ^ - a| + Pr I ||0t - 0 II 2 > , 


(38) 

(39) 


(40) 

(41) 


A)0t. Further, from the remainder form of the Taylor series 
around 0 (true parameter), we get = pQ^i + (0ti — 

0)9pe|eg[e„_e], i= 1,2,... ,N, which implies that 


Therefore, PrlT* > T* + e} will be upperbounded by 

V 2 


N 


N 


^ (4i-0)5pe|e 


ee [ 0 , 1 . 0 ] 


(a) 


N 


< ||0ti - 0||2^||9pe|ee[e„,e]||2 


+2c?exp |—AuTri?^ ^1 — 
which is less than or equal to S if 


l-e—0} 


)}' 


i=l 


(b) 

< C'||0tl-0||2, 


r > 


■log 


A.(l-e-f) +logC'i) 


where (a) follows from Cauchy-Schwartz inequality and (b) / 

follows from Assumption 1 in Section |IV] Therefore we have for Am > j log ^ + log 
(|38 T i - (l4Tt at the top of this page, where (a) follows from using 
0ti = AOs + (1 — A)0i followed by the triangle inequality. 

Here, £ = {||0t — 0||2 < we let Dt < Vl/C. The 

first term can be expressed as follows; 


where 




C, = 


1 


1 - ^ exp <! - 


{■ 


2m(n-||0-0p||2)2 

(b^ 


} 


Pr<i ||0s-0||2 > 
< Pr 


1 /H 


-Di 


< Pr sup 

.l<i<d 


A \C 

{||0s-0s||2+||0s-0||2 

'0s,i-0 
2 


These together with the conditions SI > ||©s — 0||2 and 


^exp|- 


2m(a-l|0-0p||2)- 
(b-a)- 


■| < 1 proves Theorem |6l 


< dPr 


0s,i - Os,^ 


> ( O -|| 0 ,- 0||2 
> (H-110,-0112) 


(42) 


where 0=1(0 — Dt) > ||0s — 0||2- However, 0^,^ € [a, h\ 
is an unbiased estimator of 0s,i- Therefore, by Hoeffding’s 
inequality, we can write 
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dPr 


0s,i - 0s 
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> (O- ||0s-0||2)‘ 


f 2m(!i-||e.-e||2)- 
““P'-- 




(43) 


Next, we have 


Pr 


{|| 0 .- 0 || 2 > 


l-Af < 


2dexp |—A„7ri?^ ^1 — exp{—ArT(l — e (44) 


where a? := 


t 2 (b-a)^(i-A)^ ’ '■^® inequality follows from 

dJTl i by replacing jdC'^ with ■ 
























